Characterization and comparative analysis of the complete plastid genomes of four Astragalus species

Astragalus is the largest flowering plant genus. We assembled the plastid genomes of four Astragalus species (Astragalus iranicus, A. macropelmatus, A. mesoleios, A. odoratus) using next-generation sequencing and analyzed their plastomes including genome organization, codon usage, nucleotide diversity, prediction of RNA editing and etc. The total length of the newly sequenced Astragalus plastomes ranged from 121,050 bp to 123,622 bp, with 110 genes comprising 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes. Comparative analysis of the chloroplast genomes of Astragalus revealed several hypervariable regions comprising three non-coding sites (trnQ(UUG)–accD, rps7 –trnV(GAC) and trnR(ACG)–trnN(GUU)) and four protein-coding genes (ycf1, ycf2, accD and clpP), which have potential as molecular markers. Positive selection signatures were found in five genes in Astragalus species including rps11, rps15, accD, clpP and ycf1. The newly sequenced species, A. macropelmatus, has an approximately 13-kb inversion in IR region. Phylogenetic analysis based on 75 protein-coding gene sequences confirmed that Astragalus form a monophyletic clade within the tribe Galegeae and Oxytropis is sister group to the Coluteoid clade. The results of this study may helpful in elucidating the chloroplast genome structure, understanding the evolutionary dynamics at genus Astragalus and IRLC levels and investigating the phylogenetic relationships. Moreover, the newly plastid genomes sequenced have been increased the plastome data resources on Astragalus that can be useful in further phylogenomic studies.


Introduction
The chloroplast is a semi-autonomous organelle that plays important roles in photosynthesis, carbon fixation, and fatty acids, starch, and amino acids synthesis [1]. The chloroplast genomes (plastomes) of most angiosperms have a circular quadripartite structure composed of two identical copies of inverted repeat regions (IRa and IRb) that divide the rest of the plastome into LSC (large single-copy) and SSC (small single-copy) region [2]. There are about 80 protein-coding regions, four rRNA genes, and 30 tRNA genes in the plastome of land plants a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 with the average size of 151 kb. Due to their slower evolution than nuclear genomes, absence of recombination, and prevalence of uniparental inheritance, plastid genome sequences have been shown to be useful molecular resources for phylogenetic analyses and elucidating the genetic relationships among taxa [3]. In terms of structural organization, gene/intron content, and gene order, the plastid genomes of photosynthetic green plants are generally highly conserved [2,4,5]. However, extensive rearrangements of plastomes such as IR lost have been reported in some lineages including Geraniaceae [6], Orobanchaceae [7][8][9] and Fabaceae [10,11]. Fabaceae, the third-largest angiosperm family, includes a large clade of over 4000 species, 52 genera and nine tribes known as IRLC (Inverted Repeat Lacking Clade), which are distinguished by the absence of a single copy of the IR region [10,[12][13][14]. The plastomes of the IRLC have undergone many rearrangements, including numerous gene/intron losses [15][16][17], sequence inversions [18,19], gene transfers to the nucleus [16,20] and the second independent IR gain in some lineages [19,21]. The two main explanations presented for genomic instability in the IRLC are the absence of the IR and repeat-mediated recombination [18,[21][22][23]. Astragalus, the largest genus in flowering plants and legumes, belongs to the tribe Galegeae in the IRLC. This genus contains approximately 2500-3000 species distributed on all continents except Australia, primarily in the northern hemisphere's and South America's cool and arid continental regions [24,25]. In recent years, due to the rapid advancement of next-generation sequencing (NGS) technology, plastid genomes of about 38 Astragalus species (including 25 species of Neo-Astragalus (the New World aneuploid species) [24,26] and the rest of species belong to other clades) have been deposited in NCBI (the National Center for Biotechnology Information). In the previous study, the connection between repeat structure and plastome variation among the New World Astragalus species was investigated [27]. Accordingly, Astragalus may be an appropriate group to investigate how the plastid genome structure and content have changed on a fine scale throughout evolution. On the other hand, other previous studies [28,29] only sampled a small number of Astragalus species in comparative analyses.
In this study, we sequenced the chloroplast genomes of four Astragalus species (Astragalus iranicus Bunge, A. macropelmatus Bunge, A. mesoleios Boiss. et Hohen. and A. odoratus Lam.) and performed detailed comparative genomic analyses with previously reported plastid genomes of Astragalus as well as the rest of the IRLC plastomes. We aimed to 1) recognize the plastid genome structure, gene order and gene content of Astragalus species 2) investigate the origin, pattern, evolution and phylogenetic utility of plastid genome rearrangements in Astragalus 3) assess the effectiveness of complete chloroplast genome sequences in phylogenetic studies and 4) screen the highly informative regions of Astragalus plastomes for future Sanger based studies.

Extraction and sequencing of plastid DNA
The young leaves of three Astragalus species (A. iranicus, A. macropelmatus, A. mesoleios) were collected from the southern and northern slopes of Alborz mountain chain in Tehran, Iran. A. odoratus was collected from an orchard, in Urmia, W. Azerbaijan, Iran. They were identified by Prof. S. Kazempour-Osaloo. These species were kept in the Tarbiat Modares University Herbarium (TMUH) (voucher code: 2016-2, 2016-3, 2016-4 and 2016-5, respectively). No permit was required to take the samples that were not on the list of national key protected plants. The fresh leaves were quickly dried with silica gel before DNA extraction. Our experimental research, composing the plant materials collection, follows international, national and institutional guidelines. Genomic DNA was extracted from dried leaves using a DNeasy Plant Kit (Qiagen) based on the manufacturer's instructions. Illumina miSeq-550 platform was used for DNA sequencing at Arizona State University after its quality and quantity were assessed using 1% agarose gel electrophoresis. The paired-end libraries were constructed in accordance with the manufacturer's instructions (Illumina Inc., San Diego, CA).

Genome assembly and annotation
FastQC [30] was used to compare the quality of the generated short read data across species. The generated sequencing data was used to de novo assemble plastid genomes with Velvet v.1.2.10 [31] by generating contigs with different kmer values. To confirm the Velvet assemblies, NOVOPlasty [32], as another assembly method, was performed for each Astragalus species and matK sequence of Astragalus nakaianus (KR296789) was used as seed. The newly assembled plastomes were annotated using GeSeq [33]. We improved the identification of tRNAs by using the on-line tRNAscan-SE service [34]. Raw read data were remapped to the assembled plastid genomes with Bowtie2 [35] (implemented in Geneious v.9.0.2 (https://www. geneious.com)) to specify the number of matched reads and to assess the depth of coverage. The entire plastome sequences of A. iranicus, A. macropelmatus, A. mesoleios and A. odoratus were deposited in GenBank.
The absence of IRa in the Astragalus species was confirmed using PCR and Sanger sequencing. A PCR method was used to determine whether or not the IRa region was present using diagnostic primer pairs. The primer pairs were designed to detect the presence or absence of the IRa region in either the conserved protein coding sequences ndhF and psbA or the rps19 and rpl2 protein coding regions that surround the IR region borders. In the present study, the following primer pairs were used: ndhF-F (5 0 -TATATGATTGGTCATATAATCG-3 0 ) [36] and psbA-R (5 0 -GTTATGCATGAACGTAATGCTC-3 0 ) [37]; rps19-F (5 0 -GTTCTGGACCAAGTTATT-3 0 ) [36] and rpl2-R (5 0 -ATTTGATTCTTCGTCGAC-3 0 ) [38]. The PCR amplification program implemented in this study was completely similar to the program used in the article by Moghaddam et al. (2022) [38].
In addition, the presence/absence of the inversion observed in A. macropelmatus was surveyed using PCR and Sanger sequencing in this species. In this regard, pairs of primers flanking the endpoints of inversion were designed in either ycf1 and ndhB or trnL(CAA) and ndhB conserved protein-coding sequences. The following primer pairs were used to assess the presence/absence of the inversion: ycf1-F (5'-CAATAGATAATGTGGTCAGA-3') and ndhB-R (5'-ACCCAAACAAGTATGAAACG-3'); and trnL(CAA)-R (5'-ACCATTTCACCACCAAGGC-3') and ndhB-F (5'-ACCCAAACAAGTATGAAACG-3') (designed in this study). Amplification condition was performed in a 20 μl reaction volume and consisting of 8.5 μl deionized water, 10 μl Tag red master mix (Amplicon), 0,5 μl forward primer, 0,5 μl reverse primer and 1 μL template. Mixture solution was amplified by PCR machine (Biorad). Thermal cycle programmed for 3 min at 95˚C as initial denaturation, followed by 37 cycles of 60 sec at 95˚C for denaturation, 80 sec at 53˚C (when using ycf1-F and ndhB-R primers) and 45 sec at 56˚C (when using ndhB-F and trnL(CAA)-R primers) as annealing, 70 sec at 72˚C for extension, and final extension at 72˚C for 7 min. PCR products were examined by electrophoresis at 100 V for 30 minutes in a 1% (w/v) agarose gel in 1 x TAE buffer. Electrophoresis gel was soaked in ethidium bromide for 30 minutes then visualized in UV light.

Codon usage analysis
Codon usage analysis was conducted using the Bioinformatics web server (https://www. bioinformatics.org/sms2/codon_usage.html). MEGA11 [39] was also used to determine the relative synonymous codon usage (RSCU) values, which were used to show the characteristics of the variation in synonymous codon usage.

Identification of divergent hotspots and analysis of synonymous (Ks) and non-synonymous (Ka) substitution rates
The whole chloroplast genome sequences were aligned using MAFFT [41] on XSEDE v.7.402 in CIPRES Science Gateway [42] to determine nucleotide diversity (Pi) among the plastomes of the four newly sequenced Astragalus species as well as some Astragalus representative species. Using the DnaSP v.6.12 software [43], a sliding window analysis was performed to show the nucleotide diversity of the plastid genome. The window length was set to 800 bp and the step size was 200 bp. Moreover, the protein coding regions of the 20 plastomes were used to assess evolutionary rate differences within the Astragalus. Thus, we used MAFFT to align the 76 protein-coding regions separately, and then implemented DnaSP v.6.12 software to estimate the synonymous (Ks) and non-synonymous (Ka) substitution rates, as well as their ratio (Ka/ Ks).

Genome comparison
To consider divergence in plastid genomes, identity across entire plastomes was visualized using the mVISTA viewer in the Shuffle-LAGAN mode [44] among 20 Astragalus accessions, with Oxytropis bicolor (accession number: MN255323) as the reference.

Potential RNA editing sites prediction
The Predictive RNA Editor for Plants (PERP)-Cp web server (http://prep.unl.edu) [45] (with a cutoff value of 0.8), was used to predict potential RNA editing sites in Thirty-five protein coding genes of Astragalus species.

Phylogenetic reconstruction
Seventy-five protein-coding genes were identified from 53 species within the IRLC and two outgroups [Robinia pseudoacacia L. and Lotus japonicus (Regel) K.Larsen]. This study obtained the whole plastid genomes of A. iranicus, A. macropelmatus, A. mesoleios, and A. odoratus, as well as other plastomes downloaded from GenBank (S1 Table). Maximum likelihood and Bayesian inference methods were used to analyze the concatenated data. Prior to performing maximum likelihood and Bayesian analyses, a general time reversible and gamma distribution (GTR + G) model was chosen using the MrModeltest2.2 [46] under Akaike Information Criteria (AIC) [47]. Maximum likelihood analyses were implemented with the online phylogenetic software W-IQ-TREE [48] available at http://iqtree.cibiv.univie.ac.at. Rapid bootstrap analyses (with 5000 replicates) were used to calculate node supports. MrBayes v.3.2 in the CIPRES [42] was used to perform Bayesian inference with the following parameters: Markov chain Monte Carlo (MCMC) simulations with four incrementally heated chains for 10,000,000 generations, starting from random trees and sampling one out of every 1,000 generations. The first 25% of the trees were considered to be burn-ins. The remaining trees were used to make a consensus tree with a 50% majority rule and to estimate posterior probabilities. Posterior probabilities (PP) greater than 0.95 were considered as main support for a clade.

Characteristics of the newly sequenced Astragalus plastomes
The Illumina miSeq-550 system produced 2,937,124 paired-end raw reads for A. iranicus species and 10,549,829 for A. macropelmatus. The plastid genomes assembled with Velvet and NOVOPlasty were the same. The lengths of the four newly sequenced whole chloroplast genomes ranged from 121050 to 123,622 bp ( Table 1). All of the newly sequenced Astragalus chloroplast genomes showed the typical IRLC structure with having a single copy of the IR region. In this regard, the lack of infA, rps16 and rpl22 genes and the first clpP intron in the plastid genomes of four Astragalus species are noted; these regions, found in other angiosperms, are absent from the plastomes of all the IRLC taxa [15,20,49]. There were 110 genes in the four Astragalus plastid genomes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes, and four ribosome RNA (rRNA) genes ( Fig 1A, 1B, Table 2). The LSC (79,613-81,267 bp), SSC (12,614-13,758 bp), and IR (28,259-29,517 bp) regions, as well as 110 gene locations in the plastome are shown in Fig 1A, 1B. Similar genome structure and gene order were reported in newly sequenced Astragalus species, which is consistent with the plastomes of the Astragalus species studied so far. In four Astragalus species plastid genomes, 16 genes have one intron, whereas ycf3 has two introns (S2 Table). The rps12 gene is a trans-splicing gene with no introns at the 3' end. The intron of trnK(UUU) is the largest (2,530-2,567 bp) encompassing the matK gene, whereas the smallest intron (539-551 bp) belong to trnL(UAA). Although the overall gene content and arrangement within Astragalus plastid genomes is Table 1. Plastid genome information of selective Astragalus species and the four newly assembled Astragalus species. highly similar (Table 2), there are some structural differences in some species. Astragalus macropelmatus has an approximately 13-kb inversion (ndhB~trnN(GUU)) that placed trnN (GUU) next to trnL(CAA) and ndhB adjacent to ycf1 ( Fig 1B). PCR and Sanger sequencing were used to confirm the presence of the inversion in the plasome of A. macropelmatus. Two diagnostic primer pairs were designed to confirm the presence of this inversion in A. macropelmatus and to screen other Astragalus species for the presence/absence of this 13-kb inversion. PCR amplification from primers in the trnL(CAA) and ndhB protein-coding regions was expected only in taxa without the inversion, whereas PCR amplification from primers in the ycf1 and ndhB genes was expected only in taxa with the inversion. In our study, Sanger sequencing results agree with the presence of 13-kb inversion in the A. macropelmatus chloroplast genome (accession numbers: LC764454 and LC764455). The overall guanine-cytosine (GC) content of the new sequenced Astragalus plastid genomes ranged from 34.1% to 34.4% (Table 2).

Size (bp) LSC (bp) GC (%) (LSC) SSC (bp) GC (%) (SSC) IR (bp) GC (%) (IR) GC (%) Total
In the present study, other species of Astragalus whose plastomes have been sequenced to date were selected as representatives and their genome structure were compared (Table 1). Accordingly, all of the representative species of Astragalus have one copy of the IR. Some previous studies [28,50,51] have shown that A. galactites and A. laxmannii have two IR regions. In this study, two species were re-annotated and it was found that both species, like other taxa belonging to the IRLC, only have one IR region. There seems to be some rearrangements (i.e., increased repeat content) in the plastome of A. galactites. All plastid genomes of Astragalus species showed the typical structure of the IRLC composed of LSC (79,613 to 83,955), SSC (12,614 to 15,231) and one inverted repeat (26,997 to 29,517) regions ( Table 1). The longest and the shortest plastome size belonged to the A. galactites (126,117 bp) and A. iranicus (121,050 bp), respectively ( Table 1). Plastomes of the all Astragalus species are extremely conserved with respect to gene content and order. The GC contents of the LSC (33.1% to 33.6%) and SSC (29.8% to 31.1%) regions in studied species were lower than those of the IR (37.9% to 38.5%) regions. The highest guanine-cytosine (GC) content (34.4%) was found in the A. iranicus plastome, while the lowest (33.9%) was found in the A. galactites plastome (Table 1).

Codon usage bias
We examined the codon distribution in the protein-coding regions of the four newly sequenced Astragalus species and compared them with other representative species of Astragalus. These protein-coding genes showed a total of 24,871 codons in A. macropelmatus, 24,736 codons in A. odoratus, 24,675 codons in A. iranicus and 24,718 codons in A. mesoleios. These codons belonged to 61 different types of codons, which encoded 20 amino acids. The most frequent amino acid was phenylalanine, and the most frequent codon was TTT. The codons of the amino acid Arginine were found to be the least abundant in the plastomes of four Astragalus species (S3 Table). Furthermore, only one codon was found to code the amino acids methionine (ATG) and tryptophan (TGC). The most prevalent start codon is the ATG codon. In four Astragalus species, however, we found ACG as the initiator codon in the ndhD gene. The plastomes of Astragalus representative species were examined for codon usage frequency based on protein-coding gene sequences and relative synonymous codon usage (RSCU). The RSCU value, which represents non-uniformity in codon usage, was calculated by dividing the actual observed values of the codon by the theoretical expectations. A value larger than 1.0 indicates that a positive codon usage bias exists for a codon, while a value less than 1.0 indicates that a negative codon usage bias exists for a codon. When the RSCU value is 1.0, there is no codon usage bias [52]. The total number of codons in the Astragalus species varies from 21,470 codons in A. arrectus (as the smallest codon number) to 24,871 codons in A. macropelmatus. Methionine (AUG) and tryptophan (UGG) with RSCU = 1 had no bias. Meanwhile, the greatest RSCU value was recorded for UUA and AGA that encode leucine and serine amino acids respectively and the lowest belonged CUG that encode leucine (S4 Table). Therefore, UUA and AGA were positively biased while CUG was negatively biased. Furthermore, leucine exhibited A or T (U) bias in all synonymous codons (UUA, UUG, CUU, CUC, CUA, and CUG). Except for the UUG codon, all biased relative synonymous codons (RSCU > 1) ended with an A or U. Furthermore, the majority of codons ending in C or G have an RSCU value less than one (S4 Table). High A/U preference, a common phenomenon in higher plant plastomes [53], was observed in the third codon of Astragalus species. The RSCU values of the plastomes are a useful source of evolutionary traits resulting from selection and mutation which are essential for investigating organism evolution [54,55].

Repeat structure and simple sequence repeats
Repetitive motifs play an important role in computing repeat, deletion, and rearrangement events in the chloroplast genome [55]. Repeat analysis of four newly sequenced Astragalus plastomes detected 42 (A. mesoleios) to 50 (A. iranicus, A. macropelmatus, A. odratus) repeat structures ranging in length from 30 to 434 bp. In this study, A. iranicus reported 50 long repeats comprised of 32 forward (F), 15 palindromic (P), two reverse (R) and one complement (C) repeats, A. macropelmatus recorded 50 long repeats composed of 38 F and 12 P repeats, A. mesoleios showed 42 long repeats included 23 F, 16 P, two R and one C repeats and A. odoratus demonstrated 50 long repeats consisted of equal numbers of F and P (23) and four R repeats (S5 Table). The most abundant type of repeats was the forward, with lengths ranging from 23 to 38 bp in all four species (Fig 2). These repeats could be useful in investigating the population genetic of these four Astragalus species.
The SSR distribution patterns in plastid genome of the Astragalus species included in this study were all similar. In these Astragalus species, the total number of SSRs ranged from 120 (A. nakaianus) to 80 (A. gypsodes) (Fig 3A, Table 4). In each of the 20 representative species, mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs were observed, with the average percentages of mono-, di-, tri-, tetranucleotide SSRs were 51.48%, 25.32%, 9.36%, and 12.04%, respectively. In all cp genomes, we found penta-and hexanucleotide motifs to be very rare ( Fig  3B, Table 4). The majority of SSRs were AT-rich and rarely contain CG. These results are consistent with the observed Leguminosae species [28,29,57]. These SSR loci are located primarily in the LSC region as compared to the SSC and the IR regions.

Nucleotide substitution rates
Using DnaSP v.6.12, we compared the non-synonymous (Ka) to synonymous (Ks) substitution ratio (denoted as ω) for 76 protein-coding genes between the newly sequenced taxa and other selective Astragalus species (S7 Table). The Ka/Ks ratio is frequently used to assess the natural selection pressure and evolution rate of nucleotides, which is an important marker in investigation

PLOS ONE
Complete plastid genomes of four Astragalus species of species evolution. Accordingly, ω is an indicator of adaptive evolution or positive selection. Neutral evolution is signified by a ω value of 1, ω < 1 indicates purifying (negative) selection, and ω > 1 indicates that the gene is under the positive (adaptive) selection [58]. Most of the 75 protein-coding genes had a low ω value (less than 0.9), inferring that most of these genes were affected by purifying selection during the evolution. The highest nonsynonymous and synonymous rates were found in the clpP gene, which encodes a caseinolytic peptidase participating throughout plastid protein metabolism (Ka = 0.107069, Ks = 0.078252). In this study, the ω value was estimated to be 0 for seven genes in the LSC/IR region (rps12, psbL, psbT, psbN, psbH, petN, petG) (S7 Table). This occurred as a result of the Ka or Ks being 0 or extremely low, thus ω could not be calculated [57]. The most rapidly evolving genes in Astragalus species which indicates positive selection, were rps11, rps15, accD, clpP and ycf1. These results suggested that the chloroplast genes in different Astragalus species may have been underwent to different selection pressures.

Genomic divergence
To assess genomic divergence, mVISTA sequence identity analysis [44] was performed on the 15 Astragalus species, Oxytropis bicolor was used as a reference. We observed lower divergence in the IR region and protein-coding sequences than in non-coding regions which also occurred in almost higher plants (S1 Fig). High nucleotide variations were found among Astragalus species for the protein-coding genes ycf1, ycf2, accD and clpP as well as intergenic regions such as trnQ(UUG)-accD, rps7 -trnV(GAC) and trnR(ACG)-trnN(GUU). These divergence hotspot regions provided valuable data for the development of molecular markers for Astragalus species identification, population genetics and phylogenetic analyses. The same regions were found in the plastid genomes of the Astragalus species using sliding window analysis. Nucleotide variability (Pi) was calculated using the DnaSP software to estimate the sequence divergence level. The average value of Pi among the 20 chloroplast genomes of Astragalus species was calculated to be 0.02933 (Fig 4). Seven regions (accD, clpP, ycf1 and ycf2 as protein-coding regions and trnK(UUU)-rbcL, rps7-trnV(GAC) and trnR(ACG)-trnN(GUU) as intergenic regions) demonstrated high nucleotide variability, with Pi values > 0.05 (Fig 4) and were located in the LSC and IR regions. mVISTA also obtained similar results. These are rapid evolutionary change regions in the chloroplast genomes that may be useful for the population genetics, phylogenetic reconstruction and development of molecular markers.

Prediction of RNA editing sites
RNA editing is a post-transcriptional modification process in higher plant chloroplasts that converts cytidine (C) to uridine (U) or U to C at peculiar sites within RNA. Prep-CP prediction tool was used to predict RNA editing sites of plastid genes in 20 species of Astragalus (S8 Table). In total, an average of 40 editing sites were exist in 15-17 plastid protein-coding genes for each species, all of which were C-to-U (cytidine to uridine) conversions (S8 Table). ndh genes had the most editing sites, with a total of 18 for each species, followed by the rpoB gene, which had seven editing sites. Three editing sites were detected in accD gene. There were also two editing sites in ccsA, matK and rpoC1 genes. There was only one editing site found for each of the remaining genes. Our study showed that in Astragalus species, the probability of editing ndh genes is higher than other genes at mRNA level.

Phylogenetic relationship analysis
To identify the phylogenetic positions and relationships of the newly sequenced Astragalus species (A. iranicus, A. macropelmatus, A. mesoleios and A. odoratus), Bayesian inference (BI) and maximum likelihood (ML) methods of phylogenetic analysis were implemented based on 75 protein-coding region datasets from 49 plant taxa from different tribes of the IRLC, with Lotus japonicus and Robinia peudoacacia used as outgroups. The phylogenetic topologies of the ML and BI trees were similar, with high support values (Fig 5). Therefore, only the ML tree is shown in Fig 5. The phylogenetic tree can be defined into five clades: clade I comprises tribes Wisterieae and Glycyrrhizeae which were sister to the entire of the IRLC, clade II contains tribes Cicereae, Trifolieae and Fabeae and genus Galega (from paraphyletic tribe Galegeae), clade III consists of monophyletic tribes Caraganeae and Hedysareae, clade IV includes tribe Coluteae and genus Oxytropis and clade V comprises monophyletic genus Astragalus. 22 species of Astragalus form a well-supported clade that include four subclades. All the Astragalus subclades were fully supported in the phylogenetic tree. A. macropelmatus as basal branch along with A. mongholicus, A. membranaceus and A. nakaianus formed subclade A. A. odoratus, A. canadensis and A. bhotanensis constituted the subclade B. Next subclade (C) comprised A. iranicus, A. strictus, A. laxmannii, A. galactites and A. scaberrimus. In subclade D, A. mesoleios was the first diverging lineage and was sister to the remaining species, which are Neo-Astragalus species.

Plastid genome organization and gene content
In this study, we used the Illumina platform to sequence four Astragalus chloroplast genomes (A. iranicus, A. macropelmatus, A. mesoleios, and A. odoratus) and compared them to the other published plastomes of the same genus available on GenBank. Our assembly results showed that the plastome structures of the four species were similar to the other Astragalus species and the lengths of their plastid genomes ranged from 121,050 bp to 123,622 bp ( Fig  1A). All the plastomes, like other members of the IRLC, have lost single copy of the IR [10,59]. Although the overall genome structure and gene content within the four newly sequenced

PLOS ONE
Astragalus is highly similar and conserved, a structural change is detected in A. macropelmatus plastome. The cp genome of A. macropelmatus has experienced a distinctive 13-Kbp inversion in the IR region that has not been observed in any of the legume plastomes sequenced to date (Fig 1B). In all the legumes, ndhB gene is located next to the trnL(CAA) gene, while in A. macropelmatus, ndhB gene has been rearranged, inverted and located next to the ycf1 gene. The structure of the plastid genome is highly conserved in most photosynthetic angiosperm lineages, except Campanulaceae, Fabaceae and Geraniaceae that represent extensive rearrangements [20,60]. The most significant rearrangements within the Fabaceae is inversion that has been occurred in several lineages such as in most of the papilionoids with 50-kb inversion [12,15], in Tylosema esculentum, one of the basal most legumes, with a unique inverted region of six genes rbcL, accD, psaI, ycf4, cemA and petA [61], in plastome of Onobrychis viciifolia with ycf2/trnI (CAU)/trnL(CAA) inverted genes [38] and in plastid genome of Gueldenstaedtia verna with four inverted regions (trnK-psbK, accD-rpl23, rps15 -trnL, trnL-trnI) [62]. Furthermore, in the recent study of Charboneau et al. (2021) [27], several inversions were shown in the plastome of four species of the New World Astragalus. Astragalus calycosus has~7-kb inversion between rbcL and trnH(GUG); there is a~40-kb inversion between trnQ(UUG) and trnT(UGU) in A. mollissimus; and A. flexuosus and A. neglectus have experienced~7-kb inversion between trnL (CAA) and trnI(CAU). Various mechanisms are involved in the occurrence of inversion in the genome, one of the most significant of which is the existence of the short-inverted repeats adjacent the endpoints of these regions [27,63]. High level of repeat content and count (specially near the ends of inversions) has been found in taxa with high levels of genomic rearrangements [6,18,27] but there may or may not be a causative connection between the two correlated variables. Due to relative rarity and ease of determination of inversions, these regions are extremely valuable and useful in phylogenetic studies [59]. Other mechanism for the occurrence of inversions is intramolecular recombination between repeat elements and tRNA genes [27,38].
Repetitive elements can also cause other genomic rearrangements such as gene/intron loss, pseudogenization and a second independent IR region gain (for the IRLC taxa). IR reemergence has been reported in some IRLC taxa in the recent studies (e.g. Medicago minima [19,64] and Melilotus dentata [21]). In this regard, IR reemergence in two Astragalus species (A. galactites and A. laxmannii) has been erroneously reported [28,50,51]. In this study, the complete plastid genomes of the both species were re-annotated and it was found that, like other members of the IRLC, they have only one IR region.
In addition to the genus Astragalus, the plastome of some other genera of the IRLC have also undergone different genomic rearrangements. For example, the lack of accD gene in some species of Trifolium [18,20,65], the absence of the second clpP intron (except intron 1 of clpP gene which is absent at the base of the IRLC) [15]) in Glycyrrhiza glabra, G. lepidota, Tibetia liangshanensis and seven species of Neo-Astragalus [20,27,66] and loss of rpl23, rpl33 and ycf4 genes in some Lathyrus, Pisum and Vicia species [16,17,29]. The occurrence of numerous rearrangements in the plastid genome of the IRLC taxa has made it a great model for studying plastome evolution. The occurrence of different rearrangements in the IRLC might be consequence of presence of many tandemly repeated sequences, the lack of IR and variability in IR region size [19,66,67].
The plastomes among Astragalus species were identical in guanine-cytosine content, but the GC% in the LSC and SSC regions were remarkably lower than IR region because of the presence of rRNA genes (rrn23, rrn16, rrn5, rrn4.5) with high GC content (50%-56.4%) [38,57,68]. GC content may be the most significant factor related to the phenomenon of codon usage bias among different organisms. Relative synonymous codon usage (RSCU) patterns are similar among Astragalus species and could provide useful reference for phylogenetic relationship analysis [53,54].
Repeat sequences play important roles in the evolution and rearrangements of the plastome and can be used to develop genetic markers for population and phylogenetic studies [53]. Four types of repeats were found in plastid genome of Astragalus species using REPuter software. In the majority of the studied Astragalus species, forward dispersed repeats were found to be the most abundant, followed by palindromic and reverse repeats, and the least complement. Moreover, repeat sequences were mostly dispersed in non-coding regions of the Astragalus species. The presence of these repeat sequences indicates the loci which could be significant hotspots for plastid genome reconfiguration [38,66]. In the studied Astragalus species, the most abundant observed motifs were mononucleotides and A/T repeats were the most frequent but no G/C motif found in their cp genomes. Strong A/T preference in SSR loci has been observed in many legume [38,57,68,69] and non-legume [70,71] species which may contribute to the bias in base composition [68]. SSRs were distributed across the plastome, with the highest frequency in the LSC region which may be related to the lack of single copy of the IR region in the IRLC taxa. SSRs can exhibit high genetic polymorphism and mutation rates and are frequently used for the development of molecular markers and play a crucial role in the recombination and rearrangement of genome, population genetics, gene mapping and identification of species [72].
Highly variable DNA markers are useful for identifying closely related species and provide abundant information for broad-scale phylogenetic analyses. In this study based on mVISTA and sliding window analysis, accD, clpP, ycf1 and ycf2 as protein-coding regions and trnK (UUU)-rbcL, rps7-trnV(GAC) and trnR (ACG)-trnN(GUU) as intergenic regions, which showed some extent divergence, were detected with higher Pi values (Fig 4) and have potential to be used as DNA markers. Some of these regions have been used as molecular markers in previous Astragalus phylogenetics analyses [73][74][75]. The ycf1 is a more variable gene than matK and is suitable for lower taxonomic levels in DNA barcoding and molecular systematics [3,76], as well as, the clpP gene codes a caseinolytic peptidase located in the LSC region and demonstrated accelerated mutation in the IRLC [75]. Further studies are needed to assess whether these variable regions can be served in Astragalus phylogenetic analyses or use as great candidate markers for species authentication and population genetic.

Plastid RNA editing prediction and positive selection analysis
RNA editing is a type of post transcriptional modification, which involves the insertion, deletion or conversion of cytidine (C) to uridine (U) nucleic acid bases in the chloroplasts of higher plants [77]. The present study found that the ndh genes had the most editing sites in the plastid genomes of Astragalus species. Also, the ndh group genes have shown the most chloroplast editing sites in flowering plants [77,78]. The plastid ndh genes, which encode components of the thylakoid NDH complex, have either been lost or pseudogenized in various species of algae, bryophytes, pteridophytes, gymnosperms, angiosperms [78][79][80]. Some studies have mentioned that the products of the ndh genes might be unessential for plants growth under normal conditions [77]. RNA editing is crucial for the function of the NDH protein complex as well as for improving plant photosynthesis under adverse conditions [77].
We estimated the Ka/Ks for each gene in DnaSP v.6.12 to assess the selective pressure on protein-coding sequences on Astragalus plastomes. The Ka/Ks ratio is a common method for studying adaptive evolution or positive selection in plant species. In our evaluation, the Ka/Ks ratio reveals positive selection for rps11, rps15, accD, clpP and ycf1 genes in the plastid genomes of the studied Astragalus species. The rps gene family are involved in self-replication, accD gene encodes the beta-carboxyl transferase subunit of acetyl-CoA carboxylase which is necessary for plant development, clpP gene, as mentioned earlier, codes a caseinolytic peptidase and contains two introns (Intron 1 of clpP gene has been lost across the IRLC and also the second clpP intron was absent in some Neo-Astragalus, Glycyrrhiza and Tibetia species [27,66]) and ycf1 gene encodes a protein with approximately 1,800 amino acids and is essential for plant viability [76]. In a study [28], it was, however, shown that cemA and rpl33 were underwent positive selection in Astragalus species. There are some regions with accelerated mutation rates in the plastome of legumes and in particular IRLC taxa which have been undergone adaptive evolution, including rps16-accD-psaI-ycf4-cemA region. In this region, rps16 gene was lost across the IRLC [16,59], ycf4 gene shows positive selection in some taxa of the tribe Fabeae (Lathyrus, Pisum and Vavilovia) [16,17] and accD gene absent in the Trifolium subgen. Trifolium [18,20,65].

Phylogenetic implications
In our study, the chloroplast-based Astragalus phylogenomics was strongly supported and consistent with previous studies [27,28,81,82]. In accordance with previous studies, IRLC is monophyletic and consists of several tribes/lineages including Wisterieae, Glycyrrhizeae, Galega (Galegeae), Cicereae, Trifolieae, Fabeae, Caraganeae, Hedysareae, Coluteoid clade and genera Astragalus and Oxytropis [12,28,81,82]. In the present study, four newly sequenced Astragalus species (A. iranicus, A. macropelmatus, A. mesoleios and A. odoratus) along with other selected species of Astragalus from GenBank, form a monophyletic clade. In many previous studies [12,83], the genus Oxytropis was retrieved as the sister to Astragalus, but in recent plastid DNA-based phylogenetic analyses [28,38,81,82] in agreement with the present study, Oxytropis united with Coluteoid clade (tribe Coluteae) which, in turn, closely related to Astragalus. Our results showed that each of the four newly sequenced species of Astragalus (A. iranicus, A. macropelmatus, A. mesoleios and A. odoratus) were placed in their respective clade, as in previous studies [83,84].
The results of our phylogenetic analysis for Astragalus species imply that complete plastid genome database can be powerful resource to construct relationships among species of this genus. The rapid development of plastome sequencing technologies has the potential to provide useful genomic information for reconstructing phylogenetic relationships at lower and higher taxonomic levels.

Conclusions
We sequenced, assembled and compared the plastid genomes of four Astragalus species in this study (A. iranicus, A. macropelmatus, A. mesoleios and A. odoratus). All these species belong to the IRLC and tribe Galegeae. The organization and gene contents of the Astragalus plastomes were detected to be well conserved, however, the A. macropelmatus plastome showed a unique inversion (13-kb) in the IR region. We also obtained such comprehensive molecular information as codon usage, distribution of SSRs and repeat sequences, prediction of RNA editing, detection of hotspot regions and phylogenomic analysis. In addition, seven hypervariable regions (accD, clpP, ycf1 and ycf2 as protein-coding regions and trnK(UUU)-rbcL, rps7-trnV (GAC) and trnR(ACG)-trnN(GUU) as intergenic regions) were detected, which might be used as molecular markers for genus/species identification. Our findings increase the data on the plastomes of Astragalus and provide useful resource for future research on population genetics, molecular phylogeny and evolution of Astragalus.
Supporting information S1